This Rmarkdown script (and corresponding TAB-separated CSV input data file InfoRateData.csv and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Human languages share an optimal information transmission rate.
For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).
The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.
For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.
Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:
Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.
| Language | Family | ISO 639-3 | Corpus |
|---|---|---|---|
| Basque | Basque | EUS | E-Hitz (Perea et al., 2006) |
| British English | Indo-European | ENG | WebCelex (MPI for Psycholinguistics) |
| Cantonese | Sino-Tibetan | YUE | A linguistic corpus of mid-20th century Hong Kong Cantonese |
| Catalan | Indo-European | CAT | Frequency dictionary (Zséder et al., 2012) |
| Finnish | Uralic | FIN | Finnish Parole Corpus |
| French | Indo-European | FRA | Lexique 3.80 (New et al., 2001) |
| German | Indo-European | DEU | WebCelex (MPI for Psycholinguistics) |
| Hungarian | Uralic | HUN | Hungarian National Corpus (Váradi, 2002) |
| Italian | Indo-European | ITA | The Corpus PAISÀ (Lyding et al., 2014) |
| Japanese | Japanese | JPN | Japanese Internet Corpus (Sharoff, 2006) |
| Korean | Korean | KOR | Leipzig Corpora Collection (LCC) |
| Mandarin Chinese | Sino-Tibetan | CMN | Chinese Internet Corpus (Sharoff, 2006) |
| Serbian | Indo-European | SRP | Frequency dictionary (Zséder et al., 2012) |
| Spanish | Indo-European | SPA | Frequency dictionary (Zséder et al., 2012) |
| Thai | Tai-Kadai | THA | Thai National Corpus (TNC) |
| Turkish | Turkic | TUR | Leipzig Corpora Collection (LCC) |
| Vietnamese | Austroasiatic | VIE | VNSpeechCorpus (Le et al., 2004) |
The data is structured as follows:
| Lng | # spkrs | % fem | # age | mean(age) | sd(age) | actual ages |
|---|---|---|---|---|---|---|
| CAT | 10 | 50 | 10 | 35.4 | 9.2 | (21, 28, 28, 29, 31, 39, 42, 42, 44, 50) |
| CMN | 10 | 50 | 9 | 23.1 | 4.5 | (19, 19, 19, 19, 24, 24, 25, 28, 31) |
| DEU | 10 | 50 | 0 | NaN | NaN | () |
| ENG | 10 | 50 | 0 | NaN | NaN | () |
| EUS | 10 | 50 | 10 | 28.0 | 4.9 | (19, 22, 26, 27, 28, 29, 30, 31, 32, 36) |
| FIN | 10 | 50 | 10 | 33.2 | 11.0 | (16, 22, 26, 28, 30, 35, 37, 41, 45, 52) |
| FRA | 10 | 50 | 10 | 32.5 | 7.7 | (24, 25, 25, 27, 28, 36, 36, 37, 41, 46) |
| HUN | 10 | 50 | 10 | 39.3 | 15.8 | (17, 27, 27, 31, 33, 39, 42, 51, 57, 69) |
| ITA | 10 | 50 | 0 | NaN | NaN | () |
| JPN | 10 | 50 | 10 | 30.6 | 12.8 | (20, 20, 21, 22, 22, 28, 29, 40, 51, 53) |
| KOR | 10 | 50 | 10 | 28.6 | 10.6 | (16, 19, 19, 19, 28, 31, 33, 35, 36, 50) |
| SPA | 10 | 50 | 10 | 33.7 | 10.1 | (21, 22, 26, 28, 30, 32, 42, 42, 44, 50) |
| SRP | 10 | 50 | 10 | 30.6 | 7.8 | (19, 21, 23, 30, 31, 32, 34, 34, 38, 44) |
| THA | 10 | 50 | 10 | 30.1 | 5.7 | (23, 23, 27, 28, 30, 31, 31, 32, 33, 43) |
| TUR | 10 | 50 | 7 | 32.6 | 7.2 | (24, 25, 30, 31, 37, 37, 44) |
| VIE | 10 | 50 | 6 | 27.2 | 4.1 | (21, 25, 26, 28, 31, 32) |
| YUE | 10 | 50 | 10 | 22.0 | 1.5 | (20, 20, 21, 21, 22, 22, 23, 23, 24, 24) |
NS: exploratory plots.
mean=101.217, median=100, sd=24.694, CV=0.244, min=49, max=162, kurtosis=2.485, skewness=0.178.
SR: exploratory plots.
SR per speaker.
SR by Sex and Age across Languages.
SR by Sex, Age and Language.
SR by language.
mean=6.612, median=6.758, sd=1.133, CV=0.171, min=3.589, max=9.384, kurtosis=2.381, skewness=-0.197.
ShE and ID: exploratory plots.
ShE vs ID.
Pearson's product-moment correlation
data: tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02052914 0.77274779
sample estimates:
cor
0.4647009
Spearman's rank correlation rho
data: tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.4462208
Paired t-test
data: tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.158040 3.119607
sample estimates:
mean of the differences
2.638824
ShE:
mean=8.619, median=8.69, sd=0.906, CV=0.105, min=6.07, max=9.83, kurtosis=4.662, skewness=-1.124.
ID:
mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.521, skewness=0.741.
ShIR: exploratory plots.
ShIR per speaker.
ShIR by Sex and Age across Languages.
ShIR by Sex, Age and Language.
ShIR by language.
IR: exploratory plots.
IR per speaker.
IR by Sex and Age across Languages.
IR by Sex, Age and Language.
IR by language.
ShIR:
mean=56.525, median=56.923, sd=9.217, CV=0.163, min=32.772, max=83.042, kurtosis=2.335, skewness=0.048.
IR:
mean=39.03, median=39.06, sd=4.944, CV=0.127, min=25.631, max=56.309, kurtosis=3.424, skewness=0.242.
SR vs ID
Pearson's product-moment correlation
data: info.rate.data$SR and info.rate.data$ID
t = -46.642, df = 2263, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7205100 -0.6784761
sample estimates:
cor
-0.7000991
Spearman's rank correlation rho
data: info.rate.data$SR and info.rate.data$ID
S = 3315800000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.7121454
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.30 |
| Language | 0.54 |
| Speaker | 0.00 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.01 |
| Language | 0.62 |
| Speaker | 0.29 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.01 |
| Language | 0.56 |
| Speaker | 0.34 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.02 |
| Language | 0.36 |
| Speaker | 0.49 |
| model | AIC | BIC |
|---|---|---|
| 1 + (1 | Text) + (1 | Language) + (1 | Speaker) | 1997.71 | 2026.34 |
| 1 + (1 | Language) + (1 | Speaker) | 2269.46 | 2292.36 |
| 1 + (1 | Text) + (1 | Speaker) | 2131.65 | 2154.55 |
| 1 + (1 | Text) + (1 | Language) | 4491.18 | 4514.08 |
| 1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) | 1990.24 | 2024.59 |
| 1 + Sex + (1 | Language) + (1 | Speaker) | 2261.83 | 2290.46 |
| 1 + Sex + (1 | Text) + (1 | Speaker) | 2131.41 | 2160.04 |
| 1 + Sex + (1 | Text) + (1 | Language) | 4364.41 | 4393.04 |
We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).
| model | AIC | BIC |
|---|---|---|
| 1 + (1 | Text) + (1 | Language) + (1 | Speaker) | 10040.96 | 10069.59 |
| 1 + (1 | Language) + (1 | Speaker) | 10307.46 | 10330.36 |
| 1 + (1 | Text) + (1 | Speaker) | 10093.03 | 10115.93 |
| 1 + (1 | Text) + (1 | Language) | 12574.31 | 12597.21 |
| 1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) | 10029.47 | 10063.82 |
| 1 + Sex + (1 | Language) + (1 | Speaker) | 10295.8 | 10324.43 |
| 1 + Sex + (1 | Text) + (1 | Speaker) | 10086.29 | 10114.91 |
| 1 + Sex + (1 | Text) + (1 | Language) | 12439.91 | 12468.54 |
We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).
We will use a Gaussian distribution (with fixed or modelled variance).
******************************************************************
Summary of the Quantile Residuals
mean = 6.730826e-05
variance = 1.000442
coef. of skewness = -0.05533865
coef. of kurtosis = 3.321179
Filliben correlation coefficient = 0.9990828
******************************************************************
Deviance= 1048.133
AIC= 1447.224
******************************************************************
Summary of the Quantile Residuals
mean = 0.00173847
variance = 1.000438
coef. of skewness = -0.02188322
coef. of kurtosis = 2.7558
Filliben correlation coefficient = 0.9991323
******************************************************************
Deviance= 718.4239
AIC= 1298.507
The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"),
data = d, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.453045 0.007615 847.38 <2e-16 ***
SexM 0.320518 0.011484 27.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.30980 0.02098 -62.432 < 2e-16 ***
SexM 0.11215 0.02972 3.774 0.000165 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 290.0417
Residual Deg. of Freedom: 1974.958
at cycle: 49
Global Deviance: 718.4239
AIC: 1298.507
SBC: 2959.092
******************************************************************
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.39916
Random effect parameter sigma_b: 0.10842
Smoothing parameter lambda : 85.615
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 16.9836
Random effect parameter sigma_b: 0.870522
Smoothing parameter lambda : 1.32957
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 165.2788
Random effect parameter sigma_b: 0.55531
Smoothing parameter lambda : 3.49812
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 0.4520879
Random effect parameter sigma_b: 0.00945165
Smoothing parameter lambda : 9841.95
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.92063
Random effect parameter sigma_b: 0.161583
Smoothing parameter lambda : 33.8914
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 74.00739
Random effect parameter sigma_b: 0.165399
Smoothing parameter lambda : 33.2177
******************************************************************
Summary of the Quantile Residuals
mean = 0.0002544616
variance = 1.000442
coef. of skewness = 0.01305823
coef. of kurtosis = 3.360931
Filliben correlation coefficient = 0.9990575
******************************************************************
Deviance= 9108.233
AIC= 9507.395
******************************************************************
Summary of the Quantile Residuals
mean = 0.001908036
variance = 1.000449
coef. of skewness = -0.02180224
coef. of kurtosis = 2.755167
Filliben correlation coefficient = 0.999126
******************************************************************
Deviance= 8782.485
AIC= 9364.281
Again, this is a better fit to the data. The full summary of the model is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"),
data = d, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.0868 0.0452 842.63 <2e-16 ***
SexM 1.8954 0.0685 27.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47195 0.02098 22.496 < 2e-16 ***
SexM 0.11387 0.02972 3.832 0.000131 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 290.8981
Residual Deg. of Freedom: 1974.102
at cycle: 7
Global Deviance: 8782.485
AIC: 9364.281
SBC: 11029.77
******************************************************************
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.42023
Random effect parameter sigma_b: 0.658146
Smoothing parameter lambda : 2.32345
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 16.9518
Random effect parameter sigma_b: 3.06993
Smoothing parameter lambda : 0.10691
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 165.1512
Random effect parameter sigma_b: 3.31501
Smoothing parameter lambda : 0.0981557
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 0.1016434
Random effect parameter sigma_b: 0.00435605
Smoothing parameter lambda : 46312
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.96861
Random effect parameter sigma_b: 0.163755
Smoothing parameter lambda : 32.9875
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 75.30454
Random effect parameter sigma_b: 0.168083
Smoothing parameter lambda : 32.1735
Let’s model SR with ID as an additional predictor (fixed effect). N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + ID + Sex + random(Text) + random(Speaker), sigma.formula = ~1 + ID + Sex + random(Text) + random(Speaker), family = NO(mu.link = "identity"), data = d, control = gamlss.control(n.cyc = 800,
trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.79082 0.03676 320.71 <2e-16 ***
ID -0.88773 0.00584 -152.02 <2e-16 ***
SexM 0.31151 0.01138 27.37 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.79830 0.10475 -7.621 3.88e-14 ***
ID -0.08666 0.01705 -5.084 4.04e-07 ***
SexM 0.10976 0.02972 3.694 0.000227 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 293.2313
Residual Deg. of Freedom: 1971.769
at cycle: 20
Global Deviance: 686.0572
AIC: 1272.52
SBC: 2951.366
******************************************************************
******************************************************************
Summary of the Quantile Residuals
mean = 0.001886609
variance = 1.000438
coef. of skewness = -0.02395902
coef. of kurtosis = 2.675602
Filliben correlation coefficient = 0.998938
******************************************************************
Deviance= 686.0572
AIC= 1272.52
Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
ID 16.4166 16.4166 1 166.56 162.630 < 2.2e-16 ***
Sex 0.8059 0.8059 1 166.64 7.984 0.005296 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.42494
Random effect parameter sigma_b: 0.109972
Smoothing parameter lambda : 83.2162
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 167.4871
Random effect parameter sigma_b: 0.745948
Smoothing parameter lambda : 1.94065
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 1.725638
Random effect parameter sigma_b: 0.0190072
Smoothing parameter lambda : 2323.82
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 103.5937
Random effect parameter sigma_b: 0.234395
Smoothing parameter lambda : 16.0007
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ Age * Sex + (1 | Text) + (1 | Language)
Data: info.rate.data
REML criterion at convergence: 3542.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.8928 -0.6374 0.0099 0.6180 3.2178
Random effects:
Groups Name Variance Std.Dev.
Text (Intercept) 0.01204 0.1097
Language (Intercept) 1.01283 1.0064
Residual 0.33378 0.5777
Number of obs: 1959, groups: Text, 15; Language, 14
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.662e+00 2.791e-01 1.499e+01 23.873 2.44e-13 ***
Age -6.214e-03 2.177e-03 1.930e+03 -2.855 0.00436 **
SexM 1.809e-01 8.875e-02 1.928e+03 2.038 0.04165 *
Age:SexM 3.061e-03 2.771e-03 1.928e+03 1.105 0.26934
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Age SexM
Age -0.237
SexM -0.176 0.682
Age:SexM 0.170 -0.716 -0.956
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: IR ~ Age * Sex + (1 | Text) + (1 | Language)
Data: info.rate.data
REML criterion at convergence: 10507.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.8281 -0.6366 0.0021 0.6134 3.5322
Random effects:
Groups Name Variance Std.Dev.
Text (Intercept) 0.3903 0.6247
Language (Intercept) 10.1823 3.1910
Residual 11.8771 3.4463
Number of obs: 1959, groups: Text, 15; Language, 14
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 39.33036 0.96006 20.47163 40.966 < 2e-16 ***
Age -0.04213 0.01298 1933.80587 -3.246 0.00119 **
SexM 1.08647 0.52933 1929.49454 2.053 0.04025 *
Age:SexM 0.01913 0.01652 1929.61240 1.158 0.24705
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Age SexM
Age -0.411
SexM -0.305 0.682
Age:SexM 0.295 -0.716 -0.956
So, it seems * Age* and Sex are both worth including in our models (even if we have to discard quite a bit of data because of missing Age info). (In fact, the effect of Age seems more significant than that of Sex.)
In the following, we investigate if Age does matter when using GAMLSS modelling…
Bescause there is missing data fro Age, and because the GAMLSS models require no missing data, we will fit the models with Age (and its interaction with Sex) on the subset of the data that contains only those speakers with Age info. To make comparability possible, we also fit the same models but without Age on the exact same subset of the data.
******************************************************************
Summary of the Quantile Residuals
mean = -6.753434e-05
variance = 1.000511
coef. of skewness = -0.077143
coef. of kurtosis = 3.389036
Filliben correlation coefficient = 0.9987482
******************************************************************
The model including Age * Sex is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + Sex * Age + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"), data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,
trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.611389 0.033550 197.061 < 2e-16 ***
SexM 0.165485 0.045834 3.611 0.000314 ***
Age -0.003173 0.001050 -3.024 0.002534 **
SexM:Age 0.003532 0.001428 2.473 0.013492 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.18831 0.01598 -74.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 1959
Degrees of Freedom for the fit: 161.6168
Residual Deg. of Freedom: 1797.383
at cycle: 33
Global Deviance: 903.5859
AIC: 1226.819
SBC: 2128.672
******************************************************************
The compared models are:
| Model | Deviance | AIC |
|---|---|---|
| Age * Sex | 903.6 | 1226.8 |
| Age + Sex | 903.6 | 1224.8 |
| Sex | 903.6 | 1222.8 |
So, even if Age has a significant (negative) effect and interaction with Sex (positive for males), adding it does not seem to be warranted here…
******************************************************************
Summary of the Quantile Residuals
mean = 0.002281433
variance = 1.000505
coef. of skewness = -0.03450854
coef. of kurtosis = 2.771259
Filliben correlation coefficient = 0.9990369
******************************************************************
The model including Age * Sex is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + Sex * Age + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex * Age + random(Text) + random(Language) + random(Speaker),
family = NO(mu.link = "identity"), data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6340667 0.0269728 245.954 < 2e-16 ***
SexM 0.2124780 0.0400092 5.311 1.23e-07 ***
Age -0.0039802 0.0008482 -4.693 2.91e-06 ***
SexM:Age 0.0019671 0.0012747 1.543 0.123
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.382823 0.080515 -17.175 <2e-16 ***
SexM 0.040345 0.108365 0.372 0.710
Age 0.002119 0.002524 0.839 0.401
SexM:Age 0.002442 0.003380 0.723 0.470
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 1959
Degrees of Freedom for the fit: 241.9192
Residual Deg. of Freedom: 1717.081
at cycle: 89
Global Deviance: 626.5772
AIC: 1110.416
SBC: 2460.371
******************************************************************
The compared models are:
| Model | Deviance | AIC |
|---|---|---|
| Age * Sex | 626.6 | 1110.4 |
| Age + Sex | 627 | 1106.6 |
| Sex | 626.5 | 1103.6 |
So, even if Age has a significant (negative) effect (but no interaction with Sex), adding it does not seem to be warranted here either…
The distribution of the residuals is less heteroscedastic than before and the fit to the data better.
Thus, for SR, even if there is a hint that Age might affect it negatively (and there might also be an interaction with Sex with a positive effect for males), overall, the various fit indices do not warrant its inclusion in the GAMLSS models.
******************************************************************
Summary of the Quantile Residuals
mean = 0.0001754402
variance = 1.000511
coef. of skewness = -0.01921085
coef. of kurtosis = 3.370712
Filliben correlation coefficient = 0.9989935
******************************************************************
The model including Age * Sex is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = IR ~ 1 + Sex * Age + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"), data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,
trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.622244 0.196111 202.040 < 2e-16 ***
SexM 0.081767 0.267917 0.305 0.76
Age -0.053245 0.006135 -8.679 < 2e-16 ***
SexM:Age 0.050748 0.008348 6.079 1.47e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.57733 0.01598 36.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 1959
Degrees of Freedom for the fit: 161.7827
Residual Deg. of Freedom: 1797.217
at cycle: 3
Global Deviance: 7821.387
AIC: 8144.953
SBC: 9047.731
******************************************************************
The compared models are:
| Model | Deviance | AIC |
|---|---|---|
| Age * Sex | 7821.4 | 8145 |
| Age + Sex | 7821.4 | 8142.9 |
| Sex | 7821.4 | 8141 |
So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…
******************************************************************
Summary of the Quantile Residuals
mean = 0.001954218
variance = 1.000561
coef. of skewness = -0.03439158
coef. of kurtosis = 2.780556
Filliben correlation coefficient = 0.9990362
******************************************************************
The model including Age * Sex is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = IR ~ 1 + Sex * Age + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex * Age + random(Text) + random(Language) + random(Speaker),
family = NO(mu.link = "identity"), data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.588880 0.157737 250.981 < 2e-16 ***
SexM 0.178796 0.233398 0.766 0.444
Age -0.052356 0.004871 -10.748 < 2e-16 ***
SexM:Age 0.047628 0.007308 6.517 9.4e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4322418 0.0804980 5.370 8.97e-08 ***
SexM 0.0243379 0.1084601 0.224 0.822
Age 0.0007875 0.0025236 0.312 0.755
SexM:Age 0.0029412 0.0033831 0.869 0.385
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 1959
Degrees of Freedom for the fit: 242.042
Residual Deg. of Freedom: 1716.958
at cycle: 5
Global Deviance: 7558.83
AIC: 8042.914
SBC: 9393.554
******************************************************************
The compared models are:
| Model | Deviance | AIC |
|---|---|---|
| Age * Sex | 7558.8 | 8042.9 |
| Age + Sex | 7559.3 | 8039.2 |
| Sex | 7558.4 | 8035.9 |
So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…
The distribution of the residuals is less heteroscedastic than before and the fit to the data better.
Thus, while for IR the hint that Age has a negative main effect and interacts with Sex (with a positive effect for males, containing the whole effect of Sex) is much stronger, the various fit indices do not warrant its inclusion in the GAMLSS models.
******************************************************************
Summary of the Quantile Residuals
mean = 0.001628
variance = 1.000508
coef. of skewness = -0.04541686
coef. of kurtosis = 2.714878
Filliben correlation coefficient = 0.9989143
******************************************************************
The model including Age * Sex is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + ID + Sex * Age + random(Text) + random(Speaker), sigma.formula = ~1 + ID + Sex * Age + random(Text) + random(Speaker), family = NO(mu.link = "identity"),
data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.6733395 0.0540976 234.268 <2e-16 ***
ID -0.9842800 0.0069961 -140.690 <2e-16 ***
SexM -0.0685700 0.0401577 -1.708 0.0879 .
Age -0.0101418 0.0008598 -11.796 <2e-16 ***
SexM:Age 0.0109266 0.0012848 8.505 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7728533 0.1464453 -5.277 1.48e-07 ***
ID -0.0935798 0.0192603 -4.859 1.29e-06 ***
SexM -0.0100200 0.1092224 -0.092 0.927
Age 0.0001954 0.0025381 0.077 0.939
SexM:Age 0.0041006 0.0034092 1.203 0.229
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 1959
Degrees of Freedom for the fit: 237.2253
Residual Deg. of Freedom: 1721.775
at cycle: 9
Global Deviance: 605.9802
AIC: 1080.431
SBC: 2404.193
******************************************************************
The compared models are:
| Model | Deviance | AIC |
|---|---|---|
| Age * Sex | 606 | 1080.4 |
| Age + Sex | 606.2 | 1076.8 |
| Sex | 606.2 | 1073.1 |
Clearly, adding Age is not warranted here…
As above, we also looked a the simple lmer model:
The compared models are:
| Model | AIC |
|---|---|
| Age * Sex | 1699.9 |
| Age + Sex | 1691.7 |
| Sex | 1682 |
Interestingly, here the best model (as suggested by AIC) is the one with both Age and Sex (but no interaaction), but, as can be seen below, actually the effect of Age is not significant (p = 0.482), suggesting that, in fact, the best model is still the one not including Age:
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID + Sex + Age + (1 | Text) + (1 | Speaker)
Data: info.rate.data.for.age
REML criterion at convergence: 1677.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.9334 -0.6415 0.0292 0.5987 3.0159
Random effects:
Groups Name Variance Std.Dev.
Speaker (Intercept) 0.50569 0.7111
Text (Intercept) 0.01464 0.1210
Residual 0.10010 0.3164
Number of obs: 1959, groups: Speaker, 132; Text, 15
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 12.458339 0.526992 128.806967 23.640 <2e-16 ***
ID -0.976409 0.075150 127.928957 -12.993 <2e-16 ***
SexM 0.271988 0.124632 127.921247 2.182 0.0309 *
Age -0.004613 0.006545 127.903978 -0.705 0.4822
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) ID SexM
ID -0.910
SexM -0.114 -0.005
Age -0.521 0.167 -0.005
Within the limits of our reduced dataset (containing only 132 speakers with Age info), we found the following:
When modelling SR and IR with GAMLSS, while there are hints that Age has, for both, overall:
it does not seem warranted to include it in these models.
When modelling the relationship between SR and ID, this negative relationship:
but, alas, the inclusion of Age is not warranted in the GAMLSS model, nor (really) in the simpler LMER model.
Thus, while Age seems to negatively influence (in a sex-dependent manner) both SR and IR, as well as strengthen the negative relationship between them, its effects are far from clear in the current dataset.
In what follows, mixing probabilities are independent from factors such as Sex.
Between 1 and 5 Gaussian distributions:
1 component
Mixing Family: "NO"
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
6.612
Sigma Coefficients for model: 1
(Intercept)
0.1249
Estimated probabilities: 1
Degrees of Freedom for the fit: 2 Residual Deg. of Freedom 2263
Global Deviance: 6993.5
AIC: 6997.5
SBC: 7008.95
2 components
Mixing Family: c("NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
5.293
Sigma Coefficients for model: 1
(Intercept)
-0.4556
Mu Coefficients for model: 2
(Intercept)
7.157
Sigma Coefficients for model: 2
(Intercept)
-0.2297
Estimated probabilities: 0.2925181 0.7074819
Degrees of Freedom for the fit: 5 Residual Deg. of Freedom 2260
Global Deviance: 6862.11
AIC: 6872.11
SBC: 6900.74
3 components
Mixing Family: c("NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
5.23
Sigma Coefficients for model: 1
(Intercept)
-0.4891
Mu Coefficients for model: 2
(Intercept)
6.612
Sigma Coefficients for model: 2
(Intercept)
-0.1509
Mu Coefficients for model: 3
(Intercept)
7.271
Sigma Coefficients for model: 3
(Intercept)
-0.263
Estimated probabilities: 0.2538 0.2146865 0.5315135
Degrees of Freedom for the fit: 8 Residual Deg. of Freedom 2257
Global Deviance: 6862.26
AIC: 6878.26
SBC: 6924.07
4 components
Mixing Family: c("NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
7.164
Sigma Coefficients for model: 1
(Intercept)
-0.2973
Mu Coefficients for model: 2
(Intercept)
5.085
Sigma Coefficients for model: 2
(Intercept)
-0.5514
Mu Coefficients for model: 3
(Intercept)
5.79
Sigma Coefficients for model: 3
(Intercept)
-0.4671
Mu Coefficients for model: 4
(Intercept)
7.321
Sigma Coefficients for model: 4
(Intercept)
-0.2167
Estimated probabilities: 0.4458113 0.1825739 0.1505794 0.2210354
Degrees of Freedom for the fit: 11 Residual Deg. of Freedom 2254
Global Deviance: 6861.2
AIC: 6883.2
SBC: 6946.18
5 components
Mixing Family: c("NO", "NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
6.375
Sigma Coefficients for model: 1
(Intercept)
-0.3489
Mu Coefficients for model: 2
(Intercept)
7.229
Sigma Coefficients for model: 2
(Intercept)
-1.002
Mu Coefficients for model: 3
(Intercept)
7.971
Sigma Coefficients for model: 3
(Intercept)
-0.5582
Mu Coefficients for model: 4
(Intercept)
6.249
Sigma Coefficients for model: 4
(Intercept)
-0.4571
Mu Coefficients for model: 5
(Intercept)
5.04
Sigma Coefficients for model: 5
(Intercept)
-0.6053
Estimated probabilities: 0.1790673 0.2320782 0.1988533 0.2003551 0.1896461
Degrees of Freedom for the fit: 14 Residual Deg. of Freedom 2251
Global Deviance: 6844.33
AIC: 6872.33
SBC: 6952.48
Comparing AIC
df AIC
mix.SR.NO.2 5 6872.115
mix.SR.NO.5 14 6872.329
mix.SR.NO.3 8 6878.264
mix.SR.NO.4 11 6883.205
mix.SR.NO.1 2 6997.498
Showing the distributions
Mixture of Gaussians for SR.
Between 1 and 5 Gaussian distributions:
1 component
Mixing Family: "NO"
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
39.03
Sigma Coefficients for model: 1
(Intercept)
1.598
Estimated probabilities: 1
Degrees of Freedom for the fit: 2 Residual Deg. of Freedom 2263
Global Deviance: 13666.9
AIC: 13670.9
SBC: 13682.3
2 components
Mixing Family: c("NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
39.76
Sigma Coefficients for model: 1
(Intercept)
1.766
Mu Coefficients for model: 2
(Intercept)
38.36
Sigma Coefficients for model: 2
(Intercept)
1.334
Estimated probabilities: 0.4807269 0.5192731
Degrees of Freedom for the fit: 5 Residual Deg. of Freedom 2260
Global Deviance: 13637.4
AIC: 13647.4
SBC: 13676.1
3 components
Mixing Family: c("NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
35.42
Sigma Coefficients for model: 1
(Intercept)
1.339
Mu Coefficients for model: 2
(Intercept)
39.77
Sigma Coefficients for model: 2
(Intercept)
0.8876
Mu Coefficients for model: 3
(Intercept)
42.22
Sigma Coefficients for model: 3
(Intercept)
1.636
Estimated probabilities: 0.3631959 0.2936198 0.3431843
Degrees of Freedom for the fit: 8 Residual Deg. of Freedom 2257
Global Deviance: 13618
AIC: 13634
SBC: 13679.8
4 components
Mixing Family: c("NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
42.31
Sigma Coefficients for model: 1
(Intercept)
1.7
Mu Coefficients for model: 2
(Intercept)
39.42
Sigma Coefficients for model: 2
(Intercept)
0.6282
Mu Coefficients for model: 3
(Intercept)
40.07
Sigma Coefficients for model: 3
(Intercept)
1.316
Mu Coefficients for model: 4
(Intercept)
34.48
Sigma Coefficients for model: 4
(Intercept)
1.25
Estimated probabilities: 0.2568725 0.1723648 0.3018819 0.2688808
Degrees of Freedom for the fit: 11 Residual Deg. of Freedom 2254
Global Deviance: 13613.3
AIC: 13635.3
SBC: 13698.2
5 components
Mixing Family: c("NO", "NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
42.85
Sigma Coefficients for model: 1
(Intercept)
1.694
Mu Coefficients for model: 2
(Intercept)
37.37
Sigma Coefficients for model: 2
(Intercept)
1.456
Mu Coefficients for model: 3
(Intercept)
39.57
Sigma Coefficients for model: 3
(Intercept)
0.4153
Mu Coefficients for model: 4
(Intercept)
40.57
Sigma Coefficients for model: 4
(Intercept)
1.307
Mu Coefficients for model: 5
(Intercept)
35.26
Sigma Coefficients for model: 5
(Intercept)
1.339
Estimated probabilities: 0.2127589 0.2188109 0.1190228 0.2221345 0.2272729
Degrees of Freedom for the fit: 14 Residual Deg. of Freedom 2251
Global Deviance: 13612
AIC: 13640
SBC: 13720.1
Comparing AIC
df AIC
mix.IR.NO.3 8 13633.99
mix.IR.NO.4 11 13635.26
mix.IR.NO.5 14 13639.96
mix.IR.NO.2 5 13647.45
mix.IR.NO.1 2 13670.87
Showing the distributions
Mixture of Gaussians for IR.
We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):
diptest;For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:
The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).
Permutation of the texts’ SRs (PM1).
Permutation of the languages’ ID (PM2).
Permutation of the speakers’ average SRs (PM3).
Permutation of the languages’ average SRs with speaker adjustement (PM4).
| Scenario | Measure | Test | Observed estimate (p-value) | % more unimodal permutations |
|---|---|---|---|---|
| PM1 | SR | Silverman | - (0.019) | 68.2% |
| PM1 | SR | Dip | 0.005 (0.987) * | 100% |
| PM1 | SR | BC | 0.193 () * | 100% |
| PM1 | IR | Silverman | - (0.174) * | 85.8% |
| PM1 | IR | Dip | 0.004 (0.994) * | 55.3% |
| PM1 | IR | BC | 0.165 () * | 100% |
| PM2 | SR | Silverman | - (0.019) | 68% |
| PM2 | SR | Dip | 0.005 (0.987) * | 100% |
| PM2 | SR | BC | 0.193 () * | 100% |
| PM2 | IR | Silverman | - (0.174) * | 25.4% |
| PM2 | IR | Dip | 0.004 (0.994) * | 16.6% |
| PM2 | IR | BC | 0.165 () * | 97.8% |
| PM3 | SR | Silverman | - (0.019) | 100% |
| PM3 | SR | Dip | 0.005 (0.987) * | 0% |
| PM3 | SR | BC | 0.193 () * | 100% |
| PM3 | IR | Silverman | - (0.174) * | 86.4% |
| PM3 | IR | Dip | 0.004 (0.994) * | 5.9% |
| PM3 | IR | BC | 0.165 () * | 99.8% |
| PM4 | SR | Silverman | - (0.019) | 5.4% |
| PM4 | SR | Dip | 0.005 (0.987) * | 2.1% |
| PM4 | SR | BC | 0.193 () * | 76.6% |
| PM4 | IR | Silverman | - (0.174) * | 11.2% |
| PM4 | IR | Dip | 0.004 (0.994) * | 8.2% |
| PM4 | IR | BC | 0.165 () * | 98.7% |
We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.
Comparing the distribution of pairwise distances between languages.
| m1 | m2 | d | mean1 | median1 | sd1 | mean2 | median2 | sd2 | p |
|---|---|---|---|---|---|---|---|---|---|
| IR | NS | Hellinger | 0.89 | 0.83 | 0.32 | 1.20 | 1.19 | 0.39 | 0.00 |
| IR | NS | Jensen-Shannon | 0.18 | 0.14 | 0.12 | 0.29 | 0.26 | 0.16 | 0.00 |
| IR | NS | Kolmogorov–Smirnov | 0.42 | 0.37 | 0.20 | 0.57 | 0.57 | 0.23 | 0.00 |
| IR | NS | Kullback-Leibler | 7.43 | 4.77 | 7.88 | 15.43 | 13.12 | 13.30 | 0.00 |
| IR | NS | Squared-Chi | 0.57 | 0.47 | 0.36 | 0.88 | 0.80 | 0.47 | 0.00 |
| IR | SR | Hellinger | 0.89 | 0.83 | 0.32 | 1.13 | 1.09 | 0.49 | 0.00 |
| IR | SR | Jensen-Shannon | 0.18 | 0.14 | 0.12 | 0.28 | 0.24 | 0.20 | 0.00 |
| IR | SR | Kolmogorov–Smirnov | 0.42 | 0.37 | 0.20 | 0.57 | 0.56 | 0.27 | 0.00 |
| IR | SR | Kullback-Leibler | 7.43 | 4.77 | 7.88 | 15.88 | 10.06 | 16.30 | 0.00 |
| IR | SR | Squared-Chi | 0.57 | 0.47 | 0.36 | 0.87 | 0.78 | 0.58 | 0.00 |
| NS | SR | Hellinger | 1.20 | 1.19 | 0.39 | 1.13 | 1.09 | 0.49 | 0.07 |
| NS | SR | Jensen-Shannon | 0.29 | 0.26 | 0.16 | 0.28 | 0.24 | 0.20 | 0.61 |
| NS | SR | Kolmogorov–Smirnov | 0.57 | 0.57 | 0.23 | 0.57 | 0.56 | 0.27 | 0.93 |
| NS | SR | Kullback-Leibler | 15.43 | 13.12 | 13.30 | 15.88 | 10.06 | 16.30 | 0.76 |
| NS | SR | Squared-Chi | 0.88 | 0.80 | 0.47 | 0.87 | 0.78 | 0.58 | 0.89 |
Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.
Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.
Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.
Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.
Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.
Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.
Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.
New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.
Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf
Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.
Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.
Váradi, T. (2002). The Hungarian National Corpus. In LREC.
Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.
R session infoThis document was compiled on:
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: compiler, grid, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base
other attached packages: broman(v.0.68-2), philentropy(v.0.3.0), diptest(v.0.75-7), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.2), sjstats(v.0.17.3), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-2), nlme(v.3.1-137), gamlss.dist(v.5.1-1), MASS(v.7.3-51.1), gamlss.data(v.5.1-0), lmerTest(v.3.1-0), lme4(v.1.1-20), Matrix(v.1.2-15), plyr(v.1.8.4), reshape2(v.1.4.3), ggplot2(v.3.1.0) and RhpcBLASctl(v.0.18-205)
loaded via a namespace (and not attached): RColorBrewer(v.1.1-2), numDeriv(v.2016.8-1), tools(v.3.4.4), TMB(v.1.7.15), backports(v.1.1.3), R6(v.2.4.0), sjlabelled(v.1.0.16), lazyeval(v.0.2.1), colorspace(v.1.4-0), withr(v.2.1.2), tidyselect(v.0.2.5), mnormt(v.1.5-5), emmeans(v.1.3.2), sandwich(v.2.5-0), labeling(v.0.3), scales(v.1.0.0), mvtnorm(v.1.0-8), psych(v.1.8.12), ggridges(v.0.5.1), stringr(v.1.4.0), digest(v.0.6.18), foreign(v.0.8-71), minqa(v.1.2.4), rmarkdown(v.1.11), stringdist(v.0.9.5.1), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), pwr(v.1.2-2), rlang(v.0.3.1), generics(v.0.0.2), zoo(v.1.8-4), dplyr(v.0.8.0.1), magrittr(v.1.5), modeltools(v.0.2-22), bayesplot(v.1.6.0), Rcpp(v.1.0.0), munsell(v.0.5.0), prediction(v.0.3.6.2), stringi(v.1.3.1), multcomp(v.1.4-8), yaml(v.2.2.0), snakecase(v.0.9.2), sjmisc(v.2.7.7), forcats(v.0.4.0), crayon(v.1.3.4), lattice(v.0.20-38), ggeffects(v.0.8.0), haven(v.2.1.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), estimability(v.1.3), codetools(v.0.2-16), stats4(v.3.4.4), glue(v.1.3.0), evaluate(v.0.13), data.table(v.1.12.0), modelr(v.0.1.4), nloptr(v.1.2.1), gtable(v.0.2.0), purrr(v.0.3.0), tidyr(v.0.8.2), assertthat(v.0.2.0), xfun(v.0.5), coin(v.1.2-2), xtable(v.1.8-3), broom(v.0.5.1), coda(v.0.19-2), survival(v.2.43-3), tibble(v.2.0.1), glmmTMB(v.0.2.3) and TH.data(v.1.0-10)
Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).